General Approach to Initial Dataset

The following excerpt from Text Mining with R: A Tidy Approach (Julia Silge and David Robinson) summarizes how extended text (in this case, TA comments) can be organized as a tidy matrix.All emphases are mine.

We thus define the tidy text format as being a table with one-token-per-row. A token is a meaningful unit of text, such as a word, that we are interested in using for analysis, and tokenization is the process of splitting text into tokens.

This one-token-per-row structure is in contrast to the ways text is often stored in current analyses, perhaps as strings or in a document-term matrix. For tidy text mining, the token that is stored in each row is most often a single word, but can also be an n-gram, sentence, or paragraph . In the tidytext package, we provide functionality to tokenize by commonly used units of text like these and convert to a one-term-per-row format.

The initial working dataset is an anonymized Excel table with one column of “TA comments” extracted from student lab reports. Each TA comment is stored as a separate row of the table, and is a mixed alphanumeric string of one or more words, numbers, and punctuation. Other columns record unique report, student, and TA IDs; grade assigned to the report; other standardized information about the original report from which the comment was extracted; and the hand-coded subject and structure of each comment. Using tidy format (vs. other common text data formats) better maintains the relationships between individual comments and metadata, which simplifies subsequent analysis.

Pre-Check Initial Dataset

Review data table in Excel. Check:

  1. Xoding vocabulary terms used in coding columns matches criteria in “codebook_for_comments.md”
  2. Data have been de-identified and recoded using pre-assigned anonymous IDs.
  3. Data table headers are in snake_case, and match tabulated list below. The same list is in the file “code_column_names.txt”
Column Number Column Name
1 unique.record
2 report.id
3 sort
4 report.title (empty)
5 student (Std_nnn)
6 course
7 ta (TA_nnn)
8 lab
9 tag
10 type.TA
11 grade.TA
12 grading.time
13 Rank
14 hypothesis.ok
15 data.ok
16 citation.ok
17 interpretation.ok
18 organization.ok
19 techflaws.ok
20 writing.ok
21 comments.incorporated
22 ta.comment
23 code.subject
24 code.structure
25 code.locus
26 code.scope
27 code.tone
28 code.notes

Export to CSV file with name formatted as “coded_full_comments_dataset_SemYear.csv”

 

Import and Post-Check Full Dataset

# Call in tidyverse, other required libraries.
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble  1.4.2     ✔ dplyr   0.7.4
## ✔ tidyr   0.8.0     ✔ stringr 1.3.0
## ✔ readr   1.1.1     ✔ forcats 0.3.0
## ── Conflicts ─────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(tidytext)
library(dplyr)
library(tidyr)
library(ggplot2)
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor

This block reads full CSV into a starting dataframe named “base_data”.

#CSV file named "coded_full_comments_dataset_Spring18anon.csv" already has correctly organized text and meta-data. 
#Subsequent analyses call in subsets of "base_data".

base_data <- read_csv(file='data/coded_full_comments_dataset_Spring18anon.csv')
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   sort = col_integer(),
##   course = col_integer(),
##   Rank = col_integer()
## )
## See spec(...) for full column specifications.

These post-import checks ensure data have been properly coded and columns contain valid entries.

#"Course" column should only have entries that correspond to 113, 114, 214, NA. 
unique(base_data$course)
## [1] 113 114  NA 214
#Entries in "TAs" column should match the anonymous IDs for TAs assigned to the relevant courses in the semester being analyzed. Check that no additional or extra anonymous IDs are present, which would suggest improper coding or de-identification.
unique(base_data$ta)
##  [1] "TA_bc8EVX09" "TA_YiRfsy28" "TA_kLMtS9Cu" "TA_l0zDB5kQ" "TA_fMNrLFgi"
##  [6] "TA_kKK8bPFb" "TA_2R8KStqZ" "TA_TaMqWbpv" "TA_axAQrfTP" "TA_7lL5KRzf"
## [11] "TA_rVWBLI1O" "TA_zoTYZxuS" "TA_0RcV6NvC" "TA_180JS8gb" "TA_T8nPUAZg"
#"Lab" should only have entries that match the limited keywords in codebook.
unique(base_data$lab)
## [1] "allocation"     "betta"          "manduca"        "frog"          
## [5] NA               "photosynthesis" "physarum"
#Look at the NUMBER of unique anonymous student IDs. Should be within 5% of (but not over) the combined enrollment of the 3 target courses for the semester.
unique(base_data$student)
##   [1] "Std_drVZ2x1W" "Std_St4XmzW9" "Std_k55eEnUN" "Std_NQE2hEZ9"
##   [5] "Std_0d5y7JDx" "Std_xVZWs2b6" "Std_WWqzJAra" "Std_aiqyipj2"
##   [9] "Std_negduoDf" "Std_Mqax2LC6" "Std_6QiXMN6g" "Std_p1Mcpg8E"
##  [13] "Std_uy1tNt0j" "Std_mq6K5YTw" "Std_lWTDqb7O" "Std_uGT6Rmiv"
##  [17] "Std_Avjg3aUu" "Std_LSB0YOpE" "Std_W703hu6U" "Std_H6n5PxOy"
##  [21] "Std_ECIzVQHD" "Std_Ip0DBkaA" "Std_ru0R5p2a" "Std_SnnOjVX4"
##  [25] "Std_OJEUVw25" "Std_MHlYBahq" "Std_De9Bg9OR" "Std_tFV9F1Ds"
##  [29] "Std_dMO8JZlj" "Std_5SFrllJn" "Std_ds6eqfUl" "Std_XvlxBhPy"
##  [33] "Std_HRS8b6PC" "Std_uHEfrYCr" "Std_bEQNz4NG" "Std_3efLeDKU"
##  [37] "Std_ThPpsyyo" "Std_c2kHhFqV" "Std_ydWUBowA" "Std_Zfn3oav0"
##  [41] "Std_EyScRSMZ" "Std_aW1vqkXu" "Std_JaSfqRxV" "Std_LzY08BzP"
##  [45] "Std_NlQtHp7Q" "Std_krtFXJc2" "Std_Vnxin5cb" "Std_v5dh7rxq"
##  [49] "Std_FdEfDUok" "Std_EtXmAMaE" "Std_o5sd5v6l" "Std_ZkBJYFSU"
##  [53] "Std_nTrseYPd" "Std_vIwnncBr" "Std_oiS1EktC" "Std_DQQKHR5x"
##  [57] "Std_F4yWIcx0" "Std_cw7hjCmF" "Std_24AzIkua" "Std_dvDlQ0vv"
##  [61] "Std_ERmhon5S" "Std_P59ECO6v" "Std_65ITLnai" "Std_ARZ2FzDG"
##  [65] "Std_NNe5WfpA" "Std_sSAj9YKg" "Std_rCbbcHSV" "Std_Y5ZgWWSU"
##  [69] "Std_I54L6pql" "Std_QnFSxmzz" "Std_fV51amd3" "Std_lP7MDo7i"
##  [73] "Std_J988mJ8o" "Std_RWwPJZEq" "Std_3ZCbXWf5" "Std_mzqfmeUf"
##  [77] "Std_E9YN4Mk4" "Std_Cl1WKGJ1" "Std_syhpWMOJ" "Std_KqqpKDxS"
##  [81] "Std_0tmbzXi9" "Std_mCobkdHA" "Std_lMDBoTdF" "Std_0R5jFmbi"
##  [85] "Std_LnJ9T4X0" "Std_zVfMRJ2x" "Std_VN99AjaM" "Std_R0Gy2FQB"
##  [89] "Std_bA2xY2Lq" "Std_JNJu7xJ7" "Std_Nu38Huyw" "Std_HxQqL9ro"
##  [93] "Std_iLyF7kub" "Std_Ciwk8IUc" "Std_QVoQBUNa" "Std_YMJL3try"
##  [97] "Std_4DppTzW5" "Std_lk7iRzZJ" "Std_hsrBfRti" "Std_91fMqP9R"
## [101] "Std_FnS1cYdm" "Std_mRNXC26d" "Std_fyTic4NJ" "Std_GZRWoU7J"
## [105] "Std_zNohGEGf" "Std_DxdgMj9o" "Std_NnOh3a5f" "Std_plkfjSYI"
## [109] "Std_VYZJB4hR" "Std_dsBgq3Je" "Std_k6IwmB3w" "Std_4AkdMlRY"
## [113] "Std_KaenyQ09" "Std_tX4A30cP" "Std_4uprNBBU" "Std_g62axQwR"
## [117] "Std_byekg9XZ" "Std_AhIP0218" "Std_MRCmP90i" "Std_PDw9Qaqq"
## [121] "Std_xRj6WabD" "Std_xiRRtWp2" "Std_OZfqbKa2" "Std_hB9pGzy8"
## [125] "Std_jwVL7IEQ" "Std_mGkyTO1S" "Std_MzRP2CUX" "Std_y7JsFbNB"
## [129] "Std_d57vCthw" "Std_cpdkqka4" "Std_aDRQYEJA" "Std_VSauewtT"
## [133] "Std_YRHctdDK" "Std_0dBvru4j" "Std_ntWT94S4" "Std_SQBxPFsh"
## [137] "Std_Jj5evFDR" "Std_mAwpSwBJ" "Std_Tu4f8cY7" "Std_e0tpV9QK"
## [141] "Std_zrSpBR7Q" "Std_aRSEtXbD" "Std_eQovDY6x" "Std_r8OFGoR2"
## [145] "Std_Nb05wCJC" "Std_jpGRMLJH" "Std_Lkr10s12" "Std_C2wSvW34"
## [149] "Std_6n8Gk3sP" "Std_sMVaeGDR" "Std_MnsOH7Mp" "Std_fcbS8I9K"
## [153] "Std_j8jpTZ4Y" "Std_Y8VEoBpi" "Std_lIEaUAzN" "Std_2cPZ3kIC"
## [157] "Std_hSkOIUZ6" "Std_wmlxyFvq" "Std_j4KpJlWe" "Std_RGmdHih0"
## [161] "Std_IozpePFm" "Std_4hesXShF" "Std_csSVIAoc" "Std_aWh1P2cJ"
## [165] "Std_j3KUc1uM" "Std_I5aGgrf4" "Std_KKNXEfyd" "Std_pyHhznvl"
## [169] "Std_3EEVDkjN" "Std_sKBKuq5g" "Std_vx2grlHl" "Std_xwn6VArZ"
## [173] "Std_DA06f58s" "Std_uhGAwMPw" "Std_SQVuPeQ1" "Std_uj41dzuN"
## [177] "Std_wpuMuSij" "Std_fm2ywxxl" "Std_xpiYkmCh" "Std_yBws4BiY"
## [181] "Std_TsywmYZg" "Std_LNA4lJ8N" "Std_0AvcpNFm" "Std_nBMVAIO6"
## [185] "Std_1yEVgMXy" "Std_88nnkx0P" "Std_6BNR6DsU" "Std_eYFep7VB"
## [189] "Std_5G25hVGp" "Std_RjgbPjlb" "Std_jgfbvMOh" "Std_S5mHQTqK"
## [193] "Std_FqGHz1Sa" "Std_XuHQprh3" "Std_bKJyajv0" "Std_oYA8I5bw"
## [197] "Std_jnXsNANY" "Std_jsVspOQx" "Std_cBIStID6" "Std_AMOVmmxy"
## [201] "Std_H1zJz3op" "Std_H3iUD7ET" "Std_bR83G1Qs" "Std_XHGZ6EVN"
## [205] "Std_vKtLaAHQ" "Std_788UIJbt" "Std_gagMA5hd" "Std_gyxsrMmB"
## [209] "Std_Ob9TgiI1" "Std_lvNKVGyp" "Std_Ifs8AaPB" "Std_UYaJ3xHb"
## [213] "Std_UnrTzkrZ" "Std_EsedhxrC" "Std_VdRGIO5q" "Std_2mpB1tpE"
## [217] "Std_LlOxCAez" "Std_7aaFBaoE" "Std_SCqjMEvU" "Std_2dYSEHxM"
## [221] "Std_s85NvPii" "Std_ZBrQc9PQ" "Std_PfI9dOa2" "Std_M904Wobq"
## [225] "Std_AqLDZ7Vz" "Std_5iwMi70u" "Std_JOyB2Sxm" "Std_siodWp7J"
## [229] "Std_yoYDFUFU" "Std_sk8zbBg2" "Std_JLxrNUhh" "Std_cE5iOIrh"
## [233] "Std_5uORBOgC" "Std_tmoEoTsD" "Std_JmgvWntD" "Std_CeNH3h8e"
## [237] "Std_HgZWwcC4" "Std_z7N1s1nD" "Std_Gewk0K5X" "Std_PtCmYezw"
## [241] "Std_VwPqdEhP" "Std_gme04CnP" "Std_1YAUQ71d" "Std_d5S8BOZW"
## [245] "Std_NnshVSF5" "Std_u2xsXoDz" "Std_FRlaGFDL" "Std_PwofxBle"
## [249] "Std_pkisbV1K" "Std_3e5u0c9P" "Std_trpvC0ah" "Std_uXS4mZaS"
## [253] "Std_dbeNieJf" "Std_ZtVMqFqK" "Std_NG3kfGk0" "Std_giNertrJ"
## [257] "Std_v3TRPKNK" "Std_18cDFx0W" "Std_4IspQVnT" "Std_C5rTMwB9"
## [261] "Std_qa6vXrb1" "Std_Pirk0D4v" "Std_RZ5wjihE" "Std_bu4jwuxx"
## [265] "Std_813s6fDn" "Std_LQhN2lAN" "Std_W0IeBITj" "Std_CiTFHFQv"
## [269] "Std_gTe3GzJq" "Std_RyUebb9P" "Std_kcCBpj85" "Std_YjrcGeOq"
## [273] "Std_4yMaUjU4" "Std_FbaSzFX8" "Std_vQ6git2q" "Std_y7CExnbl"
## [277] "Std_UrwvIkif" "Std_DHMPWMOX" "Std_eLu0eVwW" "Std_tjQFNhz6"
## [281] "Std_14kBczR4" "Std_ZsGlQZKk" "Std_feAt8ChY" "Std_V4vPOv7h"
## [285] "Std_ednBy1Rm" "Std_mi8OEAfM" "Std_uLW5Uqaf" "Std_e6Q9Gisr"
## [289] "Std_DiHEt721" "Std_ZxJV8Gx5" "Std_RVUaZljc" "Std_62X7yKgf"
## [293] "Std_3RVWQ982" "Std_tcQptViN" "Std_JU3iDXrW" "Std_DvWeDV2R"
## [297] "Std_KP2t3jbp" "Std_ySMsLvYB" "Std_4bDhaAWE" "Std_kc06qKHM"
## [301] "Std_IgwEiD0o" "Std_irndqKjH" "Std_bmt3qkUp" "Std_hWFl2SPm"
## [305] "Std_uoz4eNGQ" "Std_PgNJGmdI" "Std_88fMOMgP" "Std_HPSbQFn9"
## [309] "Std_2rjlKeP5" "Std_F1xcDDcG" "Std_mUdkMeo5" "Std_mIjd9Xfd"
## [313] "Std_BuzvhdQb" "Std_Z07raJMo" "Std_eYWuuy3e" "Std_UtBQV4H5"
## [317] "Std_X4O0mOvT" "Std_hVAGDNXh" "Std_glPFRFxW" "Std_2PSe71wc"
## [321] "Std_eibHGxXk" "Std_DdcMKGXE" "Std_dAg187gH" "Std_kI1a5Vj2"
## [325] "Std_nV1HlgvZ" "Std_28AVesSp" "Std_LGzbyvli" "Std_I2wPy46g"
## [329] "Std_mIHPhCqw" "Std_TNUSybA2" "Std_WCrvB03U" "Std_Z4iR3nL2"
## [333] "Std_dtG2XZi5" "Std_VeRClqn3" "Std_HYrAtrvl" "Std_5nLW9iZI"
## [337] "Std_EtP9gQsS" "Std_ORbHnxjf" "Std_yHpfNLHQ" "Std_dDrhx49g"
## [341] "Std_ezajrbTc" "Std_CeyBsyIE" "Std_6ylFi2T4" "Std_sjhinntD"
## [345] "Std_FIgieF18" "Std_kDY7iWqV" "Std_2azaNCCZ" "Std_NGcQpHxQ"
## [349] "Std_03qE3rus" "Std_QGFq08Xc" "Std_dxAwD4J4" "Std_BEz71uYn"
## [353] "Std_aURD8eFN" "Std_nF1X5F2t" "Std_emsp91h5" "Std_bwhQgSPL"
## [357] "Std_az2adBbj" "Std_ojGVU7q1" "Std_GV3pvfRF" "Std_3r27MwJ0"
## [361] "Std_BPqMChKa" "Std_xkK15XA1" "Std_kodKaXdo" "Std_H3SYaxWv"
## [365] "Std_CZw7YuEu" "Std_sbpNGnnF" "Std_tiCw7g9Y" "Std_q9vkWMDZ"
## [369] "Std_Gn36kqhC" "Std_1W8euIHO" "Std_FFg33ufk" "Std_UPVMxxlW"
## [373] "Std_OVWzFQKI" "Std_9e9OxIM6" "Std_IJ1NWUuY" "Std_NsOadrDc"
## [377] "Std_llrVN7KK" "Std_i1M8tttb" "Std_7N4ljVlI" "Std_STYNPwe0"
## [381] "Std_hgvj2v8R" "Std_vJAIqGmD" "Std_EyLJLzrz" "Std_kjWmincI"
## [385] "Std_Oh91eGFh" "Std_iPvfsvpe" "Std_P0KYrH0c"
#Check that "code.subject" list extracted from the dataset matches allowed terms in codebook.
unique(base_data$code.subject)
##  [1] "3. Technical and Scientific" "2. Writing Quality"         
##  [3] "4. Logic and Thinking"       "5. Praise or Concern"       
##  [5] "12. Narrative Comments"      "1. Basic Criteria"          
##  [7] "6. Misconduct"               "11. FLAG"                   
##  [9] "8. No basis to judge"        "Scientific Name"            
## [11] "7. Policy, Administrative"   "9. SPLIT"
#Check that the "code.structure" list extracted from the dataset matches allowed terms in codebook.
unique(base_data$code.structure)
## [1] "Copy Correction"   "Specific info"     "General info"     
## [4] "Holistic"          "Pointer"           "No basis to judge"
## [7] "Idiomatic"         NA                  "GEneral info"

 

Log-Odds Approach to Text Comparison

I evaluated several iterations of different analysis strategies. The log-odds ratios method outlined below has been the most informative to date.

# Isolate table rows to compare. Then reduce larger dataframe to only 2 required columns of data.

frequency_writing <- filter(base_data, code.subject=="2. Writing Quality"|code.subject=="3. Technical and Scientific"|code.subject=="4. Logic and Thinking")
frequency_writing.subcolumns <- frequency_writing %>% select(1, 22:23)
# Tokenize phrases, remove stop words listed in standard reference file.
base_data_tokenized <- frequency_writing.subcolumns %>% 
  unnest_tokens(word,ta.comment) %>% 
  anti_join(stop_words)
## Joining, by = "word"

At this point the dataframe should have “Unique.Record” in column 1, “code.subject” in column 2, and the unnested “word” in column 3. All columns are class = “character”. Stop words listed in the “stop_words” R reference data file have been removed.

Comparing Pearson Correlations of Word Frequencies

If the words used by TAs in their comments differ between the categories, there should be fairly low correlations.

#Block below re-organizes the data for comparisons. This pattern sets "2. Writing Quality" as the first dataset, then all other data in remaining columns.
#Groups by code.subject, calculated proportional frequency in each group.
#Final step creates 4 columns: word, 2. Writing Quality, code.subject, proportion
#Stats need this format to be able to compare 2. WQ against other values.
base_data_tokenized_sorted2 <- base_data_tokenized %>% 
  group_by(code.subject) %>%
  count(code.subject, word, sort=TRUE) %>% 
  mutate(proportion = n / sum(n)) %>% 
  select(-n) %>% 
  spread(code.subject, proportion) %>% 
  gather(code.subject, proportion, `3. Technical and Scientific`:`4. Logic and Thinking`)

The dataframe called “base_data_tokenized_sorted2” is the final 4-column dataframe for statistical analysis. Individual words are in column 1, “2. Writing Quality” frequency is in Column 2, “code.subject” is in column 3, and “proportion” is in column 4. Columns 1 and 3 are class = “character”, 2 and 4 are numeric ranges.

This structure is needed in order to compare word (in Column 1) frequencies for 2. Writing Quality (Column 2) as Y values against the frequencies of those same words (in Column 4) as X axes. WHICH subset of X values to use is coded in Column 3.

Calculating Pearson correlation between the frequency of words in Writing Quality versus the other two groups.

cor.test(data = base_data_tokenized_sorted2[base_data_tokenized_sorted2$code.subject == "3. Technical and Scientific",], 
         ~ proportion + `2. Writing Quality`)
## 
##  Pearson's product-moment correlation
## 
## data:  proportion and 2. Writing Quality
## t = 15.152, df = 1176, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3552434 0.4508669
## sample estimates:
##       cor 
## 0.4041588
cor.test(data = base_data_tokenized_sorted2[base_data_tokenized_sorted2$code.subject == "4. Logic and Thinking",], 
         ~ proportion + `2. Writing Quality`)
## 
##  Pearson's product-moment correlation
## 
## data:  proportion and 2. Writing Quality
## t = 20.759, df = 946, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5140770 0.6016765
## sample estimates:
##       cor 
## 0.5594369

ALL WORKS FINE TO HERE.


START OF BROKEN SECTION

# Isolate table rows to compare. Then reduce larger dataframe to only 2 required columns of data.

frequency_writing3 <- filter(base_data, code.subject=="3. Technical and Scientific"|code.subject=="2. Writing Quality"|code.subject=="4. Logic and Thinking")
frequency_writing3.subcolumns <- frequency_writing3 %>% select(1, 22:23)
# Tokenize phrases, remove stop words listed in standard reference file.
base_data_tokenized3 <- frequency_writing3.subcolumns %>% 
  unnest_tokens(word,ta.comment) %>% 
  anti_join(stop_words)
## Joining, by = "word"
#Block below re-organizes the data for comparisons. This pattern sets "3. Writing Quality" as the first dataset, then all other data in remaining columns.
#Groups by code.subject, calculated proportional frequency in each group.
#Final step creates 4 columns: word, 2. Writing Quality, code.subject, proportion
#Stats need this format to be able to compare 2. WQ against other values.
base_data_tokenized_sorted3 <- base_data_tokenized3 %>% 
  group_by(code.subject) %>%
  count(code.subject, word, sort=TRUE) %>% 
  mutate(proportion = n / sum(n)) %>% 
  select(-n) %>% 
  spread(code.subject, proportion) %>% 
  gather(code.subject, proportion, `2. Writing Quality`:`4. Logic and Thinking`)
#Create groups by code.subject, calculates proportional frequency in each group.
#Final step creates 5 columns: code.subject, n, word, total, frequency.
base_data_tokenized_sorted3 <- base_data_tokenized3 %>% 
  group_by(code.subject) %>%
  count(code.subject, word, sort=TRUE) %>% 
  left_join(base_data_tokenized %>% 
              group_by(code.subject) %>% 
              summarise(total = n())) %>%
  mutate(freq = n/total)
## Joining, by = "code.subject"

Re-organize the data for comparisons. This pattern sets 3. Technical and Scientific as the first dataset, then all other data in remaining columns.

#Groups by code.subject, calculated proportional frequency in each group.
#Final step creates 4 columns: word, 3. Technical and Scientific, code.subject, proportion
#Stats need this format to be able to compare 3. TS against other values.
base_data_tokenized_sorted3 <- base_data_tokenized3 %>% 
  group_by(code.subject) %>%
  count(code.subject, word, sort=TRUE) %>% 
  mutate(proportion = n / sum(n)) %>% 
  select(-n) %>% 
  spread(code.subject, proportion) %>% 
  gather(code.subject, proportion, `2. Writing Quality`:`4. Logic and Thinking`)

# THERE IS AN ERROR HERE I CANNOT DEBUG! If I change "3. Technical and Scientific"  to "2. Writing Quality", the structure of the tibble breaks. It goes from the correctly sorted 4 columns to incorrectly sorted 3 columns. It is NOT A TYPO; I cannot use a different order, or any other format. 

END OF BROKEN SECTION


START OF SECTION I CANNOT COPY AND MODIFY. This section depends on the ability to calculate tables as described above.

#Create groups by code.subject, calculates proportional frequency in each group.
#Final step creates 5 columns: code.subject, n, word, total, frequency.
base_data_tokenized_sorted2 <- base_data_tokenized %>% 
  group_by(code.subject) %>%
  count(code.subject, word, sort=TRUE) %>% 
  left_join(base_data_tokenized %>% 
  group_by(code.subject) %>% 
  summarise(total = n())) %>%
  mutate(freq = n/total)
## Joining, by = "code.subject"
base_data_tokenized_sorted2v3 <- base_data_tokenized_sorted2 %>% 
  select(code.subject, word, freq) %>% 
  spread(code.subject, freq) %>%
  arrange(`2. Writing Quality`,`3. Technical and Scientific`)
ggplot(base_data_tokenized_sorted2v3, aes(`2. Writing Quality`,`3. Technical and Scientific`)) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  geom_abline(color = "red")
## Warning: Removed 2946 rows containing missing values (geom_point).
## Warning: Removed 2946 rows containing missing values (geom_text).

  ggsave("Freqplot_3vs2.png", width = 10, height = 6)
## Warning: Removed 2946 rows containing missing values (geom_point).

## Warning: Removed 2946 rows containing missing values (geom_text).
#I can get part way around the problem by re-using the same data and calling a different subset for the Y axis. Want to add comparisons of 3 vs 4, but same tables cannot be created because dataframe keeps breaking when I try to change 2. WQ.

base_data_tokenized_sorted2v4 <- base_data_tokenized_sorted2 %>% 
  select(code.subject, word, freq) %>% 
  spread(code.subject, freq) %>%
  arrange(`2. Writing Quality`,`4. Logic and Thinking`)
ggplot(base_data_tokenized_sorted2v3, aes(`2. Writing Quality`,`4. Logic and Thinking`)) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  geom_abline(color = "red")
## Warning: Removed 3176 rows containing missing values (geom_point).
## Warning: Removed 3176 rows containing missing values (geom_text).

  ggsave("Freqplot_4vs2.png", width = 10, height = 6)
## Warning: Removed 3176 rows containing missing values (geom_point).

## Warning: Removed 3176 rows containing missing values (geom_text).

END of broken / limping section


Calculating and Plotting Log-Odds Pairs

2. Writing versus 3. Technical

word_ratios2v3 <- base_data_tokenized %>%
  count(word, code.subject) %>%
  group_by(word) %>%
  filter(sum(n) >= 10) %>%
  ungroup() %>%
  spread(code.subject, n, fill = 0) %>%
  mutate_if(is.numeric, funs((. + 1) / (sum(.) + 1))) %>%
  mutate(logratio = log(`2. Writing Quality`/`3. Technical and Scientific`)) %>%
  arrange(desc(logratio))
word_ratios2v3 %>% 
  arrange(abs(logratio))
## # A tibble: 871 x 5
##    word    `2. Writing Qual… `3. Technical and… `4. Logic and Th… logratio
##    <chr>               <dbl>              <dbl>             <dbl>    <dbl>
##  1 parent…          0.00110            0.00110           0.000149  7.81e-4
##  2 organi…          0.00130            0.00130           0.000744 -2.06e-3
##  3 helpful          0.000700           0.000692          0.000595  1.14e-2
##  4 specif…          0.001000           0.00102           0.00417  -1.76e-2
##  5 cell             0.000600           0.000610          0.000893 -1.76e-2
##  6 claims           0.000200           0.000203          0.00119  -1.76e-2
##  7 commen…          0.000400           0.000407          0.000595 -1.76e-2
##  8 glucose          0.000200           0.000203          0.000893 -1.76e-2
##  9 incuba…          0.000400           0.000407          0.000149 -1.76e-2
## 10 inhibi…          0.000400           0.000407          0.00119  -1.76e-2
## # ... with 861 more rows
word_ratios2v3 %>%
  group_by(logratio < 0) %>%
  top_n(20, abs(logratio)) %>%
  ungroup() %>%
  mutate(word = reorder(word, logratio)) %>%
  ggplot(aes(word, logratio, fill = logratio < 0)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  ylab("log odds ratio (Writing-top/Technical-bottom)") +
  scale_fill_discrete(name = "", labels = c("2. Writing Quality", "3. Technical and Scientific"))

  ggsave("Logplot_2vs3.png", width = 6, height = 10)

3. Technical vs. 4. Logic

word_ratios3v4 <- base_data_tokenized %>%
  count(word, code.subject) %>%
  group_by(word) %>%
  filter(sum(n) >= 10) %>%
  ungroup() %>%
  spread(code.subject, n, fill = 0) %>%
  mutate_if(is.numeric, funs((. + 1) / (sum(.) + 1))) %>%
  mutate(logratio = log(`3. Technical and Scientific`/`4. Logic and Thinking`)) %>%
  arrange(desc(logratio))
word_ratios3v4 %>% 
  arrange(abs(logratio))
## # A tibble: 871 x 5
##    word        `2. Writing Qua… `3. Technical a… `4. Logic and T… logratio
##    <chr>                  <dbl>            <dbl>            <dbl>    <dbl>
##  1 clarify            0.000600          0.000448         0.000446  0.00291
##  2 effective          0.000400          0.000448         0.000446  0.00291
##  3 independent        0.000400          0.000448         0.000446  0.00291
##  4 instar             0.0001000         0.000448         0.000446  0.00291
##  5 ions               0.0001000         0.000448         0.000446  0.00291
##  6 leaves             0.0001000         0.000448         0.000446  0.00291
##  7 meant              0.000600          0.000448         0.000446  0.00291
##  8 putting            0.000200          0.000448         0.000446  0.00291
##  9 stock              0.0001000         0.000448         0.000446  0.00291
## 10 contraction        0.00230           0.00374          0.00372   0.00654
## # ... with 861 more rows
word_ratios3v4 %>%
  group_by(logratio < 0) %>%
  top_n(20, abs(logratio)) %>%
  ungroup() %>%
  mutate(word = reorder(word, logratio)) %>%
  ggplot(aes(word, logratio, fill = logratio < 0)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  ylab("log odds ratio (Technical-top/Logic-bottom)") +
  scale_fill_discrete(name = "", labels = c("3. Technical and Scientific","4. Logic and Thinking"))

  ggsave("Logplot_3vs4.png", width = 6, height = 10)

2. Writing versus 4. Logic

word_ratios2v4 <- base_data_tokenized %>%
  count(word, code.subject) %>%
  group_by(word) %>%
  filter(sum(n) >= 10) %>%
  ungroup() %>%
  spread(code.subject, n, fill = 0) %>%
  mutate_if(is.numeric, funs((. + 1) / (sum(.) + 1))) %>%
  mutate(logratio = log(`2. Writing Quality`/`4. Logic and Thinking`)) %>%
  arrange(desc(logratio))
word_ratios2v4 %>% 
  arrange(abs(logratio))
## # A tibble: 871 x 5
##    word       `2. Writing Qual… `3. Technical a… `4. Logic and T… logratio
##    <chr>                  <dbl>            <dbl>            <dbl>    <dbl>
##  1 adding              0.000900         0.000407         0.000893  0.00782
##  2 al                  0.000300         0.000366         0.000298  0.00782
##  3 attention           0.000300         0.000407         0.000298  0.00782
##  4 bio                 0.000300         0.000488         0.000298  0.00782
##  5 commitment          0.000600         0.000366         0.000595  0.00782
##  6 conclusion          0.00150          0.000203         0.00149   0.00782
##  7 condition           0.000300         0.000977         0.000298  0.00782
##  8 days                0.000300         0.000773         0.000298  0.00782
##  9 difficult           0.000300         0.000366         0.000298  0.00782
## 10 igr                 0.000600         0.00147          0.000595  0.00782
## # ... with 861 more rows
word_ratios2v4 %>%
  group_by(logratio < 0) %>%
  top_n(20, abs(logratio)) %>%
  ungroup() %>%
  mutate(word = reorder(word, logratio)) %>%
  ggplot(aes(word, logratio, fill = logratio < 0)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  ylab("log odds ratio (Writing-top/Logic-bottom)") +
  scale_fill_discrete(name = "", labels = c("2. Writing Quality", "4. Logic and Thinking"))

  ggsave("Logplot_2vs4.png", width = 6, height = 10)

Other Analysis Approaches

This approach summarizes the word frequencies in the FULL set of TA comments first, then breaks them down into sub-groups. It has examples of other code bits as well.

The first block takes all comments from base_data, unnests tokens, filters stopwords, count and sort remaining terms, then writes results back to a new dataframe called ‘sortedwords.comments.all’.

TOTAL words is calculated and stored. Then fractional frequencies for each word are calculated and stored back to the dataframe using mutate.

# Tokenize phrases, remove stop words listed in standard reference file.
base_data_tokenized <- base_data %>% 
  unnest_tokens(word,ta.comment)

# An alternative approach to pulling out stopwords  
sortedwords.comments.all <- base_data_tokenized %>%
  filter(!word %in% stop_words$word,
         !word %in% str_remove_all(stop_words$word, "'"),
         str_detect(word, "[a-z]")) %>%
  count(word, word, sort = TRUE) %>%
  ungroup()

#Calculate the number of words in total in the subset
total_words.all <- sortedwords.comments.all %>% 
  summarize(total = sum(n))

#Mutate table to calculate then append fractional word frequency in the text.
sortedwords.comments.all.fraction <- sortedwords.comments.all %>% as_tibble() %>% mutate(
  all.fraction = (n*100)/total_words.all$total
  )

This block takes subsets of comments from base_data_tokenized using code.subject as separating characteristic. It then summarizes the word frequencies in each subset.

#Creates 8 SUBSETs of tokenized data, split out based on contents of "code.subject": 
base_data_tokenized.basic <- subset(base_data_tokenized, code.subject == "1. Basic Criteria", select = (1:28))
base_data_tokenized.writing <- subset(base_data_tokenized, code.subject == "2. Writing Quality", select = (1:28))
base_data_tokenized.technical <- subset(base_data_tokenized, code.subject == "3. Technical and Scientific", select = (1:28))
base_data_tokenized.logic <- subset(base_data_tokenized, code.subject == "4. Logic and Thinking", select = (1:28))
base_data_tokenized.praise <- subset(base_data_tokenized, code.subject == "5. Praise or Concern", select = (1:28))
base_data_tokenized.misconduct <- subset(base_data_tokenized, code.subject == "6. Misconduct", select = (1:28))
base_data_tokenized.narrative <- subset(base_data_tokenized, code.subject == "12. Narrative Comments", select = (1:28))
base_data_tokenized.sciname <- subset(base_data_tokenized, code.subject == "Scientific Name", select = (1:28))

For each subset, count words and total, then calculate frequencies and mutate the tables.

sortedwords.basic <- base_data_tokenized.basic %>%
  count(word, word, sort = TRUE) %>%
  ungroup()

#Calculate the number of words in total in the subset
total_words.basic <- sortedwords.basic %>% 
  summarize(total = sum(n))

#Mutate table to calculate then append fractional word frequency in the text.
sortedwords.basic.fraction <- sortedwords.basic %>% as_tibble() %>% mutate(
  basic.fraction = (n*100)/total_words.basic$total
)
sortedwords.writing <- base_data_tokenized.writing %>%
  filter(!word %in% stop_words$word,
         !word %in% str_remove_all(stop_words$word, "'"),
         str_detect(word, "[a-z]")) %>%
  count(word, word, sort = TRUE) %>%
  ungroup()

#Calculate the number of words in total in the subset
total_words.writing <- sortedwords.writing %>% 
  summarize(total = sum(n))

#Mutate table to calculate then append fractional word frequency in the text.
sortedwords.writing.fraction <- sortedwords.writing %>% as_tibble() %>% mutate(
  writing.fraction = (n*100)/total_words.writing$total
)
sortedwords.technical <- base_data_tokenized.technical %>%
  filter(!word %in% stop_words$word,
         !word %in% str_remove_all(stop_words$word, "'"),
         str_detect(word, "[a-z]")) %>%
  count(word, word, sort = TRUE) %>%
  ungroup()

#Calculate the number of words in total in the subset
total_words.technical <- sortedwords.technical %>% 
  summarize(total = sum(n))

#Mutate table to calculate then append fractional word frequency in the text.
sortedwords.technical.fraction <- sortedwords.technical %>% as_tibble() %>% mutate(
  technical.fraction = (n*100)/total_words.technical$total
)
sortedwords.logic <- base_data_tokenized.logic %>%
  filter(!word %in% stop_words$word,
         !word %in% str_remove_all(stop_words$word, "'"),
         str_detect(word, "[a-z]")) %>%
  count(word, word, sort = TRUE) %>%
  ungroup()

#Calculate the number of words in total in the subset
total_words.logic <- sortedwords.logic %>% 
  summarize(total = sum(n))

#Mutate table to calculate then append fractional word frequency in the text.
sortedwords.logic.fraction <- sortedwords.logic %>% as_tibble() %>% mutate(
  logic.fraction = (n*100)/total_words.logic$total
)

The next block joins the individual data tables for each subset in a larger file using column named ‘word’. Columns get confusing names assigned, so they are renamed.

WARNING: Only words that appear on ALL of the merged sets are retained. If a word is MISSING from any subset then the full set is pruned. This approach means losing the unique words, but may be useful in certain situations.

total <- merge(sortedwords.comments.all.fraction,sortedwords.writing.fraction,by="word") 
total2 <- merge(total,sortedwords.technical.fraction,by="word")
total3 <- merge(total2,sortedwords.logic.fraction,by="word")
## Warning in merge.data.frame(total2, sortedwords.logic.fraction, by =
## "word"): column names 'n.x', 'n.y' are duplicated in the result
#rename lines that were possibly confusing
names(total3)[2] <- "all.count"
names(total3)[4] <- "writing.count"
names(total3)[6] <- "technical.count"
names(total3)[8] <- "logic.count"

#Mutate table to calculate then append fractional word frequency in the text.
total4 <- total3 %>% as_tibble() %>% mutate(
  writing.wt = (writing.count *100)/all.count
)

total5 <- total4 %>% as_tibble() %>% mutate(
  technical.wt = (technical.count *100)/all.count
)

total6 <- total5 %>% as_tibble() %>% mutate(
  logic.wt = (logic.count *100)/all.count
)

The dataframe “total6” can be graphed to show fractional weights of terms in overall dataset versus individual datasets.

ggplot(total6, aes(`all.fraction`,`writing.fraction`)) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  geom_abline(color = "red")

  ggsave("Freqplot_2vsA.png", width = 10, height = 6)
ggplot(total6, aes(`all.fraction`,`technical.fraction`)) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  geom_abline(color = "red")

  ggsave("Freqplot_3vsA.png", width = 10, height = 6)
ggplot(total6, aes(`all.fraction`,`logic.fraction`)) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  geom_abline(color = "red")

  ggsave("Freqplot_4vsA.png", width = 10, height = 6)
ggplot(total6, aes(`writing.fraction`,`technical.fraction`)) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  geom_abline(color = "red")

  ggsave("Freqplot_3vs2.png", width = 10, height = 6)
ggplot(total6, aes(`technical.fraction`,`logic.fraction`)) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  geom_abline(color = "red")

  ggsave("Freqplot_4vs3.png", width = 10, height = 6)
ggplot(total6, aes(`writing.fraction`,`logic.fraction`)) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  geom_abline(color = "red")

  ggsave("Freqplot_4vs2.png", width = 10, height = 6)